Biostat 203B Homework 2

Due Feb 7, 2025 @ 11:59PM

Author

Khoa Vu UID: 705600710

Display machine information for reproducibility:

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.2    fastmap_1.2.0     cli_3.6.3        
 [5] tools_4.4.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_1.8.9    xfun_0.50        
[13] digest_0.6.37     rlang_1.1.4       evaluate_1.0.3   

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(data.table)
library(duckdb)
Loading required package: DBI
library(memuse)
library(pryr)

Attaching package: 'pryr'
The following object is masked from 'package:data.table':

    address
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.27.0 (2024-11-01 18:00:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()      masks data.table::between()
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::first()        masks data.table::first()
✖ lubridate::hour()     masks data.table::hour()
✖ lubridate::isoweek()  masks data.table::isoweek()
✖ dplyr::lag()          masks stats::lag()
✖ dplyr::last()         masks data.table::last()
✖ lubridate::mday()     masks data.table::mday()
✖ lubridate::minute()   masks data.table::minute()
✖ lubridate::month()    masks data.table::month()
✖ purrr::partial()      masks pryr::partial()
✖ lubridate::quarter()  masks data.table::quarter()
✖ lubridate::second()   masks data.table::second()
✖ purrr::transpose()    masks data.table::transpose()
✖ lubridate::wday()     masks data.table::wday()
✖ lubridate::week()     masks data.table::week()
✖ dplyr::where()        masks pryr::where()
✖ lubridate::yday()     masks data.table::yday()
✖ lubridate::year()     masks data.table::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  9.717 GiB 
Freeram:   5.982 GiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 24124664
-rwxrwxrwx 1 kvu1702 kvu1702    19928140 Jan 14 21:13 admissions.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702      427554 Jan 14 21:13 d_hcpcs.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702      876360 Jan 14 21:13 d_icd_diagnoses.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702      589186 Jan 14 21:13 d_icd_procedures.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702       13169 Jan 14 21:13 d_labitems.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702    33564802 Jan 14 21:13 diagnoses_icd.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702     9743908 Jan 14 21:13 drgcodes.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   811305629 Jan 14 21:13 emar.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   748158322 Jan 14 21:13 emar_detail.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702     2162335 Jan 14 21:13 hcpcsevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702        2907 Jan 14 21:13 index.html
-rwxrwxrwx 1 kvu1702 kvu1702 18402851720 Jan 14 21:13 labevents.csv
-rwxrwxrwx 1 kvu1702 kvu1702  2592909134 Jan 14 21:13 labevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   117644075 Jan 14 21:14 microbiologyevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702    44069351 Jan 14 21:14 omr.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702     2835586 Jan 14 21:14 patients.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   525708076 Jan 14 21:14 pharmacy.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   666594177 Jan 14 21:14 poe.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702    55267894 Jan 14 21:14 poe_detail.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   606298611 Jan 14 21:14 prescriptions.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702     7777324 Jan 14 21:14 procedures_icd.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702      127330 Jan 14 21:14 provider.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702     8569241 Jan 14 21:14 services.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702    46185771 Jan 14 21:14 transfers.csv.gz
ls -l ~/mimic/icu/
total 4253396
-rwxrwxrwx 1 kvu1702 kvu1702      41566 Jan 14 21:14 caregiver.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702 3502392765 Jan 14 21:14 chartevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702      58741 Jan 14 21:15 d_items.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   63481196 Jan 14 21:15 datetimeevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702    3342355 Jan 14 21:15 icustays.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702       1336 Jan 14 21:15 index.html
-rwxrwxrwx 1 kvu1702 kvu1702  311642048 Jan 14 21:15 ingredientevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702  401088206 Jan 14 21:15 inputevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   49307639 Jan 14 21:15 outputevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702   24096834 Jan 14 21:15 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage; all these readers can take gz file as input without explicit decompression.)

Solution:

Testing the speed of read.csv:

system.time(read.csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  9.087   0.221  10.077 
pryr::object_size(read.csv("~/mimic/hosp/admissions.csv.gz"))
200.10 MB

Testing the speed of read_csv:

system.time(read_csv("~/mimic/hosp/admissions.csv.gz"))
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
   user  system elapsed 
  1.704   0.333   1.463 
pryr::object_size(read_csv("~/mimic/hosp/admissions.csv.gz"))
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
70.02 MB

Testing the speed of fread:

system.time(fread("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  1.368   0.071   0.935 
pryr::object_size(fread("~/mimic/hosp/admissions.csv.gz"))
63.47 MB

In order of decreasing speed, read.csv() was the slowest, taking ~10 seconds, followed by read_csv(), taking ~1.8 seconds. fread() was the fastest, taking ~1.3 seconds. In order of decreasing memory usage, read.csv() was the largest, taking ~200 MB, followed by read_csv(), taking up ~70 MB. fread() took the least amount of memory at ~63.5 MB.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Solution:

#Note: we can use a compact string representation where each character represents one column:
#c = character
#i = integer
#n = number
#d = double
#l = logical
#f = factor
#D = date
#T = date time
#t = time
#? = guess
#From read_csv:
#subject_id --> INTEGER NOT NULL --> integer (i)
#hadm_id --> INTEGER NOT NULL --> integer (i)
#admittime --> TIMESTAMP NOT NULL --> data time (T)
#dischtime --> TIMESTAMP --> data time (T)
#deathtime --> TIMESTAMP --> data time (T)
#admission_type --> VARCHAR(40) NOT NULL --> character (c)
#admit_provider_id --> VARCHAR(10) --> character (c)
#admission_location --> VARCHAR(60) --> character (c)
#discharge_location --> VARCHAR(60) --> character (C)
#insurance --> VARCHAR(255) --> character (C)
#language --> VARCHAR(10) --> character (C)
#marital_status --> VARCHAR(30) --> character (C)
#race --> VARCHAR(80) --> character (C)
#edregtime --> TIMESTAMP --> data time (T)
#edouttime --> TIMESTAMP --> data time (T)
#hospital_expire_flag --> SMALLINT --> integer (i)
colum_data_types <- c("i", "i", "T", "T", "T", "i", "c", "c", "c", 
                      "c", "c", "c", "c", "T", "T", "i")
system.time(read_csv("~/mimic/hosp/admissions.csv.gz", 
                     col_types = colum_data_types))
   user  system elapsed 
  2.015   0.239   1.330 
pryr::object_size(read_csv("~/mimic/hosp/admissions.csv.gz", 
                     col_types = colum_data_types))
67.84 MB

When indicating the appropriate column types, the runtime increases to ~1.9 seconds, and the memory used increases to ~68 MB. ## Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 130x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rwxrwxrwx 1 kvu1702 kvu1702 2592909134 Jan 14 21:13 /home/kvu1702/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,2704548,50931,P69FQC,2180-03-23 11:51:00,2180-03-23 15:56:00,___,95,mg/dL,70,100,,ROUTINE,"IF FASTING, 70-100 NORMAL, >125 PROVISIONAL DIABETES."
2,10000032,,36092842,51071,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
3,10000032,,36092842,51074,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
4,10000032,,36092842,51075,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,"BENZODIAZEPINE IMMUNOASSAY SCREEN DOES NOT DETECT SOME DRUGS,;INCLUDING LORAZEPAM, CLONAZEPAM, AND FLUNITRAZEPAM."
5,10000032,,36092842,51079,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
6,10000032,,36092842,51087,P69FQC,2180-03-23 11:51:00,,,,,,,,ROUTINE,RANDOM.
7,10000032,,36092842,51089,P69FQC,2180-03-23 11:51:00,2180-03-23 16:15:00,,,,,,,ROUTINE,PRESUMPTIVELY POSITIVE.
8,10000032,,36092842,51090,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,METHADONE ASSAY DETECTS ONLY METHADONE (NOT OTHER OPIATES/OPIOIDS).
9,10000032,,36092842,51092,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,"OPIATE IMMUNOASSAY SCREEN DOES NOT DETECT SYNTHETIC OPIOIDS;SUCH AS METHADONE, OXYCODONE, FENTANYL, BUPRENORPHINE, TRAMADOL,;NALOXONE, MEPERIDINE.  SEE ONLINE LAB MANUAL FOR DETAILS."

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 3 minutes on your computer, then abort the program and report your findings.

Solution:

system.time(read_csv("~/mimic/hosp/labevents.csv.gz"))
pryr::object_size(read_csv("~/mimic/hosp/labevents.csv.gz"))

When trying to use read_csv to ingest labevents.csv.gz, my computer crashes.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

Solution:

column_subset <- c("subject_id", "itemid", "charttime", "valuenum")
system.time(read_csv("~/mimic/hosp/labevents.csv.gz", 
                     col_select = column_subset))
pryr::object_size(read_csv("~/mimic/hosp/labevents.csv.gz", 
                     col_select = column_subset))

When trying to use read_csv to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz, my computer crashes still crashes, despite reading only a subset.

Q2.3 Ingest a subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: Use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. Do not put labevents_filtered.csv.gz in Git! To save render time, you can put #| eval: false at the beginning of this code chunk. TA will change it to #| eval: true before rendering your qmd file.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file, excluding the header? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Solution:

#We use awk to select our columns of interest.
#`subject_id`, `itemid`, `charttime`, `valuenum` are columns 2, 5, 7, and 10 respectively
#and to select values of the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) using itemid, column 5.

#Note: We use NR == 1 to skip the logic in the first line, since we want to keep the headers to read later on.
zcat < ~/mimic/hosp/labevents.csv.gz | 
awk -F ',' 'NR == 1 || $5 ~ /50912|50971|50983|50902|50882|51221|51301|50931/ \
{print $2 "," $5 "," $7 "," $10}' |  gzip > ~/labevents_filtered.csv.gz
echo 'The first ten lines in labevents_filtered.csv.gz are:'
zcat < ~/labevents_filtered.csv.gz | head -n 10

#To count the rows without the header, we pipe tail -n +2
echo 'The line number, minus the header, in labevents_filtered.csv.gz is:'
zcat < ~/labevents_filtered.csv.gz | tail -n +2 |wc -l
The first ten lines in labevents_filtered.csv.gz are:
subject_id,itemid,charttime,valuenum
10000032,50931,2180-03-23 11:51:00,95
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
The line number, minus the header, in labevents_filtered.csv.gz is:
32679896
system.time(fread("~/labevents_filtered.csv.gz"))
   user  system elapsed 
 14.800  15.343  12.559 
pryr::object_size(fread("~/labevents_filtered.csv.gz"))
784.32 MB

The number of lines in labevents_filtered.csv, minus the header, is 32679896 Reading the filtered dataset takes ~9 seconds and ~784 MB of memory.

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory (do not add it in git!). To save render time, put #| eval: false at the beginning of this code chunk. TA will change it to #| eval: true when rendering your qmd file.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Solution:

#We first unzip labevents.csv.gz  
gzip -dk ~/mimic/hosp/labevents.csv.gz 
subset_columns <- c("subject_id", "itemid", "charttime", "valuenum")
subset_itemid <-c(50912, 50971, 50983, 50902, 
                   50882, 51221, 51301, 50931)

system.time(
  arrow::open_dataset("~/mimic/hosp/labevents.csv", format = "csv") %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()
  )
   user  system elapsed 
 47.752   6.961 117.073 
df <- arrow::open_dataset("~/mimic/hosp/labevents.csv", format = "csv") %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()

nrow(df)
[1] 32679896
head(df, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50931 2180-03-23 04:51:00     95  
 2   10000032  50882 2180-03-23 04:51:00     27  
 3   10000032  50902 2180-03-23 04:51:00    101  
 4   10000032  50912 2180-03-23 04:51:00      0.4
 5   10000032  50971 2180-03-23 04:51:00      3.7
 6   10000032  50983 2180-03-23 04:51:00    136  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  
#Clearing the df variable to save on memory
rm(list = ls())
gc()
          used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells 1831100 97.8    5045634 269.5   7883802  421.1
Vcells 5269364 40.3  130556756 996.1 136378951 1040.5

It takes ~3 minutes for the ingestion + selecting + filtering of labevents.csv

Apache Arrow is a software development platform built for high-performance applications involved in transporting and processing large data sets. Its in-memory columnar format holds language-independent specifications to structure table-like datasetsets. Apache Arrow has libraries implemented in various languages, including C, C++, Java, MATLAB, Python, R, and Julia.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Solution:

#Writing the Parquet file
df <- arrow::open_dataset("~/mimic/hosp/labevents.csv", format = "csv")
arrow::write_dataset(df, path = "~/", format = "parquet")
#Clearing the df variable to save on memory
rm(list = ls())
gc()
          used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells 1912669 102.2    5045634 269.5   7883802  421.1
Vcells 5412901  41.3  104445405 796.9 136378951 1040.5
#Reading the data set
subset_columns <- c("subject_id", "itemid", "charttime", "valuenum")
subset_itemid <-c(50912, 50971, 50983, 50902, 
                   50882, 51221, 51301, 50931)

system.time(
  arrow::open_dataset("~/part-0.parquet") %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()
  )
   user  system elapsed 
 27.186   5.929   7.608 
df <- arrow::open_dataset("~/part-0.parquet") %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()

nrow(df)
[1] 32679896
head(df, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000980  51301 2191-05-23 03:20:00      4.6
 2   10000980  50882 2191-05-23 22:45:00     25  
 3   10000980  50902 2191-05-23 22:45:00    108  
 4   10000980  50912 2191-05-23 22:45:00      2.1
 5   10000980  50931 2191-05-23 22:45:00    116  
 6   10000980  50971 2191-05-23 22:45:00      4  
 7   10000980  50983 2191-05-23 22:45:00    144  
 8   10000980  51221 2191-05-23 22:45:00     28  
 9   10000980  51301 2191-05-23 22:45:00      3.4
10   10000980  51221 2191-05-30 05:40:00     28.8
#Clearing the df variable to save on memory
rm(list = ls())
gc()
          used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells 1912677 102.2    5045634  269.5   7883802  421.1
Vcells 5412970  41.3  135050012 1030.4 168812514 1288.0

Ingesting, selecting, and filtering the Parquet file took ~8.5 seconds. The Parquet file is ~2.5 GB.

Optimized to handle flat columnar storage data formats, the Parquet open-source file format is very compatible with large-volume, complex data. Able to handle many encoding types, the Parquet file format is also known for its exemplary data compression ability. Using Google’s record shredding, Parquet files can perform fast queries that select specific columns without the need to read the entire data set and perform efficient column-wise compression.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Solution:

subset_columns <- c("subject_id", "itemid", "charttime", "valuenum")
subset_itemid <-c(50912, 50971, 50983, 50902, 
                   50882, 51221, 51301, 50931)

system.time(arrow::open_dataset("~/part-0.parquet"))
   user  system elapsed 
  1.393   0.062   1.467 
df <- arrow::open_dataset("~/part-0.parquet")

system.time(  
  arrow::to_duckdb(df) %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()
)
   user  system elapsed 
 32.461  16.304  13.631 
df_duckdb <- arrow::to_duckdb(df) %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()

nrow(df_duckdb)
[1] 32679896
head(df_duckdb, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10014610  51221 2173-12-21 05:13:00     27.4
 2   10014610  50882 2173-12-21 06:39:00     21  
 3   10014610  50902 2173-12-21 06:39:00    110  
 4   10014610  50912 2173-12-21 06:39:00      1  
 5   10014610  50931 2173-12-21 06:39:00    115  
 6   10014610  50971 2173-12-21 06:39:00      4.5
 7   10014610  50983 2173-12-21 06:39:00    139  
 8   10014610  51221 2173-12-22 02:03:00     23.5
 9   10014610  51301 2173-12-22 02:03:00      8  
10   10014610  50882 2173-12-22 02:03:00     25  
#Clearing the df variable to save on memory
rm(list = ls())
gc()
          used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells 2044180 109.2    5045634  269.5   7883802  421.1
Vcells 5669592  43.3  132608606 1011.8 168812514 1288.0

Solution: In total, it takes ~0.3 seconds to ingest the Parquet file and an extra ~6 seconds to convert, select, and filter the duckdb file. In total, the whole process took ~6.3 seconds.

DuckDB is a portable, analytical, in-process, and open-source database system. DuckDB uses a rich SQL dialect to read and write files in many supported formats and perform lightning-fast queries using its columnar engine, which supports parallel execution. Unlike other database systems, DuckDB is easy to install and runs in-process in many different host applications, such as Rstudio.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226512,39.4,39.4,kg,0
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226707,60,60,Inch,0
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226730,152,152,cm,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,220048,SR (Sinus Rhythm),,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,224642,Oral,,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,224650,None,,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:20:00,223761,98.7,98.7,°F,0
10000032,29079034,39553978,18704,2180-07-23 14:11:00,2180-07-23 14:17:00,220179,84,84,mmHg,0
10000032,29079034,39553978,18704,2180-07-23 14:11:00,2180-07-23 14:17:00,220180,48,48,mmHg,0

How many rows? 433 millions.

zcat < ~/mimic/icu/chartevents.csv.gz | tail -n +2 | wc -l

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Solution:

#Writing the Parquet file
df <- arrow::open_dataset("~/mimic/icu/chartevents.csv.gz", format = "csv")
arrow::write_dataset(df, path = "~/", format = "parquet")
#Clearing the df variable to save on memory
rm(list = ls())
gc()
          used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells 2042977 109.2    5045634 269.5   7883802  421.1
Vcells 5669144  43.3  106086885 809.4 168812514 1288.0
subset_columns <- c("subject_id", "itemid", "charttime", "valuenum")
subset_itemid <-c(220045, 220181, 220179, 223761, 220210)
df <- arrow::open_dataset("~/part-0.parquet")
df_duckdb <- arrow::to_duckdb(df) %>%
  select(all_of(subset_columns)) %>%
  filter(itemid %in% subset_itemid) %>%
  collect()

nrow(df_duckdb)
[1] 30195426
head(df_duckdb, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10004606 220179 2159-02-24 22:02:00    165  
 2   10004606 220181 2159-02-24 22:02:00    101  
 3   10004606 220045 2159-02-24 23:00:00     76  
 4   10004606 220210 2159-02-24 23:00:00     13  
 5   10004606 220179 2159-02-24 23:02:00    158  
 6   10004606 220181 2159-02-24 23:02:00     86  
 7   10004606 220045 2159-02-25 00:00:00     76  
 8   10004606 220210 2159-02-25 00:00:00     12  
 9   10004606 223761 2159-02-25 00:00:00     99.1
10   10004606 220179 2159-02-25 00:02:00    155  
#Clearing variables to save on ram
rm(list = ls())
gc()
          used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells 2043740 109.2    5045634 269.5   7883802  421.1
Vcells 5669923  43.3  123068546 939.0 168812514 1288.0

The number of rows is 30195426.